Compact Features for Detection of Near-Duplicates in Distributed Retrieval

نویسندگان

  • Yaniv Bernstein
  • Milad Shokouhi
  • Justin Zobel
چکیده

In distributed information retrieval, answers from separate collections are combined into a single result set. However, the collections may overlap. The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time. In this paper we introduce and analyze the grainy hash vector, a compact document representation that can be used to efficiently prune duplicate and near-duplicate documents from result lists. We demonstrate that, for a modest bandwidth and computational cost, many near-duplicates can be accurately removed from result lists produced by a cooperative distributed information retrieval system.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pseudometric Approach to Content Based Image Retrieval and Near Duplicates Detection

In this paper we investigate two approaches to content based image retrieval and their application to near duplicate detection in image collections. The first approach was proposed by C.E. Jacobs et al. [10]. It involves wavelet transformation of source image to extract features. The second approach is based on so called matrix of brightness variations which uses signs of partial derivatives of...

متن کامل

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...

متن کامل

Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search

Users of World Wide Web utilize search engines for information retrieval in web as search engines play a vital role in finding information on the web. However, the performance of a web search is greatly affected by flooding of search results with information that is redundant in nature i.e., existence of nearduplicates. Such near-duplicates holdup the other promising results to the users. Many ...

متن کامل

A Hybrid Model for Detection and Elimination of Near- Duplicates Based on Web Provenance for Effective Web Search

Users of World Wide Web utilize search engines for information retrieval in web as search engines play a vital role in finding information on the web. But, the voluminous amount of web documents has weakened the performance and reliability of web search engines. As, the subsistence of near-duplicate data is an issue that accompanies the growing need to incorporate heterogeneous data. These page...

متن کامل

Distributed Text Retrieval From Overlapping Collections

In standard text retrieval systems, the documents are gathered and indexed on a single server. In distributed information retrieval (DIR), the documents are held in multiple collections; answers to queries are produced by selecting the collections to query and then merging results from these collections. However, in most prior research in the area, collections are assumed to be disjoint. In thi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006